Data Set Information

Wisconsin Breast cancer data set used in this paper is collected from university of Wisconsin hospitals, Madison Dr.William H.Wolberg. He had introduced 699 instances with 10 attributes[17]. The class distribution is framed as Benign and malignant. There are 1 dependent variable and 9 independent variables. The values for the independent variables ranges from 1 - 10 and for class variable 2 for Benign and 4 for malignant tumor. The minimum possibilities for a person to get breast cancer is 1 and the maximum possibilities are represented by the value 10.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. More detailed description can be found here.

Separating plane described above was obtained using Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree Construction Via Linear Programming." Proceedings of the 4th Midwest Artificial Intelligence and Cognitive Science Society, pp. 97-101, 1992], a classification method which uses linear programming to construct a decision tree. Relevant features were selected using an exhaustive search in the space of 1-4 features and 1-3 separating planes.

The actual linear program used to obtain the separating plane in the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34].

Attributes Information

Attribute Name Description Category Range Values
SCNo Sample Code Number Id -
CT Clump Thickness Ordinal 1-10
UCSIZE Uniformity of Cell Size Ordinal 1-10
UCSHAPE Uniformity of Cell Shape Ordinal 1-10
MAF Marginal Adhesion Fibrous: Fibrous bands tissue that form between two surfaces Ordinal 1-10
ECSIZE Epithelial Cell Size : Size of a single cell that forms tissues that lines the outside of the body and the passageways that lead to or from the surface Ordinal 1-10
BN Bare Nuclei Ordinal 1-10
BC Bland Chromatin: Evaluates for the presence of Bare bodies Ordinal 1-10
NN Normal Nucleoli Ordinal 1-10
M Mitoses: Cell growth Ordinal 1-10
Dia Diagnosis of tumours Class 2, 4

Data View

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"
#Change the column headings
data = read.csv(file = url, header = FALSE,
                 col.names = c("SCNo","CT", "UCSize", "UCShape", "MAF", "ECSize", "BN", "BC", "NN","M", "Dia") )

data = data %>% select(-SCNo) %>% 
  mutate(
  Response = ifelse(Dia == 4, 1, 0),
  Response = as.integer(Response),
  BN = strtoi(BN)
)

kbl(data) %>%
  kable_paper() %>%
  scroll_box(width = "800px", height = "300px")
CT UCSize UCShape MAF ECSize BN BC NN M Dia Response
5 1 1 1 2 1 3 1 1 2 0
5 4 4 5 7 10 3 2 1 2 0
3 1 1 1 2 2 3 1 1 2 0
6 8 8 1 3 4 3 7 1 2 0
4 1 1 3 2 1 3 1 1 2 0
8 10 10 8 7 10 9 7 1 4 1
1 1 1 1 2 10 3 1 1 2 0
2 1 2 1 2 1 3 1 1 2 0
2 1 1 1 2 1 1 1 5 2 0
4 2 1 1 2 1 2 1 1 2 0
1 1 1 1 1 1 3 1 1 2 0
2 1 1 1 2 1 2 1 1 2 0
5 3 3 3 2 3 4 4 1 4 1
1 1 1 1 2 3 3 1 1 2 0
8 7 5 10 7 9 5 5 4 4 1
7 4 6 4 6 1 4 3 1 4 1
4 1 1 1 2 1 2 1 1 2 0
4 1 1 1 2 1 3 1 1 2 0
10 7 7 6 4 10 4 1 2 4 1
6 1 1 1 2 1 3 1 1 2 0
7 3 2 10 5 10 5 4 4 4 1
10 5 5 3 6 7 7 10 1 4 1
3 1 1 1 2 1 2 1 1 2 0
8 4 5 1 2 NA 7 3 1 4 1
1 1 1 1 2 1 3 1 1 2 0
5 2 3 4 2 7 3 6 1 4 1
3 2 1 1 1 1 2 1 1 2 0
5 1 1 1 2 1 2 1 1 2 0
2 1 1 1 2 1 2 1 1 2 0
1 1 3 1 2 1 1 1 1 2 0
3 1 1 1 1 1 2 1 1 2 0
2 1 1 1 2 1 3 1 1 2 0
10 7 7 3 8 5 7 4 3 4 1
2 1 1 2 2 1 3 1 1 2 0
3 1 2 1 2 1 2 1 1 2 0
2 1 1 1 2 1 2 1 1 2 0
10 10 10 8 6 1 8 9 1 4 1
6 2 1 1 1 1 7 1 1 2 0
5 4 4 9 2 10 5 6 1 4 1
2 5 3 3 6 7 7 5 1 4 1
6 6 6 9 6 NA 7 8 1 2 0
10 4 3 1 3 3 6 5 2 4 1
6 10 10 2 8 10 7 3 3 4 1
5 6 5 6 10 1 3 1 1 4 1
10 10 10 4 8 1 8 10 1 4 1
1 1 1 1 2 1 2 1 2 2 0
3 7 7 4 4 9 4 8 1 4 1
1 1 1 1 2 1 2 1 1 2 0
4 1 1 3 2 1 3 1 1 2 0
7 8 7 2 4 8 3 8 2 4 1
9 5 8 1 2 3 2 1 5 4 1
5 3 3 4 2 4 3 4 1 4 1
10 3 6 2 3 5 4 10 2 4 1
5 5 5 8 10 8 7 3 7 4 1
10 5 5 6 8 8 7 1 1 4 1
10 6 6 3 4 5 3 6 1 4 1
8 10 10 1 3 6 3 9 1 4 1
8 2 4 1 5 1 5 4 4 4 1
5 2 3 1 6 10 5 1 1 4 1
9 5 5 2 2 2 5 1 1 4 1
5 3 5 5 3 3 4 10 1 4 1
1 1 1 1 2 2 2 1 1 2 0
9 10 10 1 10 8 3 3 1 4 1
6 3 4 1 5 2 3 9 1 4 1
1 1 1 1 2 1 2 1 1 2 0
10 4 2 1 3 2 4 3 10 4 1
4 1 1 1 2 1 3 1 1 2 0
5 3 4 1 8 10 4 9 1 4 1
8 3 8 3 4 9 8 9 8 4 1
1 1 1 1 2 1 3 2 1 2 0
5 1 3 1 2 1 2 1 1 2 0
6 10 2 8 10 2 7 8 10 4 1
1 3 3 2 2 1 7 2 1 2 0
9 4 5 10 6 10 4 8 1 4 1
10 6 4 1 3 4 3 2 3 4 1
1 1 2 1 2 2 4 2 1 2 0
1 1 4 1 2 1 2 1 1 2 0
5 3 1 2 2 1 2 1 1 2 0
3 1 1 1 2 3 3 1 1 2 0
2 1 1 1 3 1 2 1 1 2 0
2 2 2 1 1 1 7 1 1 2 0
4 1 1 2 2 1 2 1 1 2 0
5 2 1 1 2 1 3 1 1 2 0
3 1 1 1 2 2 7 1 1 2 0
3 5 7 8 8 9 7 10 7 4 1
5 10 6 1 10 4 4 10 10 4 1
3 3 6 4 5 8 4 4 1 4 1
3 6 6 6 5 10 6 8 3 4 1
4 1 1 1 2 1 3 1 1 2 0
2 1 1 2 3 1 2 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
3 1 1 2 2 1 1 1 1 2 0
4 1 1 1 2 1 3 1 1 2 0
1 1 1 1 2 1 2 1 1 2 0
2 1 1 1 2 1 3 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
2 1 1 2 2 1 1 1 1 2 0
5 1 1 1 2 1 3 1 1 2 0
9 6 9 2 10 6 2 9 10 4 1
7 5 6 10 5 10 7 9 4 4 1
10 3 5 1 10 5 3 10 2 4 1
2 3 4 4 2 5 2 5 1 4 1
4 1 2 1 2 1 3 1 1 2 0
8 2 3 1 6 3 7 1 1 4 1
10 10 10 10 10 1 8 8 8 4 1
7 3 4 4 3 3 3 2 7 4 1
10 10 10 8 2 10 4 1 1 4 1
1 6 8 10 8 10 5 7 1 4 1
1 1 1 1 2 1 2 3 1 2 0
6 5 4 4 3 9 7 8 3 4 1
1 3 1 2 2 2 5 3 2 2 0
8 6 4 3 5 9 3 1 1 4 1
10 3 3 10 2 10 7 3 3 4 1
10 10 10 3 10 8 8 1 1 4 1
3 3 2 1 2 3 3 1 1 2 0
1 1 1 1 2 5 1 1 1 2 0
8 3 3 1 2 2 3 2 1 2 0
4 5 5 10 4 10 7 5 8 4 1
1 1 1 1 4 3 1 1 1 2 0
3 2 1 1 2 2 3 1 1 2 0
1 1 2 2 2 1 3 1 1 2 0
4 2 1 1 2 2 3 1 1 2 0
10 10 10 2 10 10 5 3 3 4 1
5 3 5 1 8 10 5 3 1 4 1
5 4 6 7 9 7 8 10 1 4 1
1 1 1 1 2 1 2 1 1 2 0
7 5 3 7 4 10 7 5 5 4 1
3 1 1 1 2 1 3 1 1 2 0
8 3 5 4 5 10 1 6 2 4 1
1 1 1 1 10 1 1 1 1 2 0
5 1 3 1 2 1 2 1 1 2 0
2 1 1 1 2 1 3 1 1 2 0
5 10 8 10 8 10 3 6 3 4 1
3 1 1 1 2 1 2 2 1 2 0
3 1 1 1 3 1 2 1 1 2 0
5 1 1 1 2 2 3 3 1 2 0
4 1 1 1 2 1 2 1 1 2 0
3 1 1 1 2 1 1 1 1 2 0
4 1 2 1 2 1 2 1 1 2 0
1 1 1 1 1 NA 2 1 1 2 0
3 1 1 1 2 1 1 1 1 2 0
2 1 1 1 2 1 1 1 1 2 0
9 5 5 4 4 5 4 3 3 4 1
1 1 1 1 2 5 1 1 1 2 0
2 1 1 1 2 1 2 1 1 2 0
1 1 3 1 2 NA 2 1 1 2 0
3 4 5 2 6 8 4 1 1 4 1
1 1 1 1 3 2 2 1 1 2 0
3 1 1 3 8 1 5 8 1 2 0
8 8 7 4 10 10 7 8 7 4 1
1 1 1 1 1 1 3 1 1 2 0
7 2 4 1 6 10 5 4 3 4 1
10 10 8 6 4 5 8 10 1 4 1
4 1 1 1 2 3 1 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
5 5 5 6 3 10 3 1 1 4 1
1 2 2 1 2 1 2 1 1 2 0
2 1 1 1 2 1 3 1 1 2 0
1 1 2 1 3 NA 1 1 1 2 0
9 9 10 3 6 10 7 10 6 4 1
10 7 7 4 5 10 5 7 2 4 1
4 1 1 1 2 1 3 2 1 2 0
3 1 1 1 2 1 3 1 1 2 0
1 1 1 2 1 3 1 1 7 2 0
5 1 1 1 2 NA 3 1 1 2 0
4 1 1 1 2 2 3 2 1 2 0
5 6 7 8 8 10 3 10 3 4 1
10 8 10 10 6 1 3 1 10 4 1
3 1 1 1 2 1 3 1 1 2 0
1 1 1 2 1 1 1 1 1 2 0
3 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
1 1 1 1 2 1 2 1 1 2 0
6 10 10 10 8 10 10 10 7 4 1
8 6 5 4 3 10 6 1 1 4 1
5 8 7 7 10 10 5 7 1 4 1
2 1 1 1 2 1 3 1 1 2 0
5 10 10 3 8 1 5 10 3 4 1
4 1 1 1 2 1 3 1 1 2 0
5 3 3 3 6 10 3 1 1 4 1
1 1 1 1 1 1 3 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
6 1 1 1 2 1 3 1 1 2 0
5 8 8 8 5 10 7 8 1 4 1
8 7 6 4 4 10 5 1 1 4 1
2 1 1 1 1 1 3 1 1 2 0
1 5 8 6 5 8 7 10 1 4 1
10 5 6 10 6 10 7 7 10 4 1
5 8 4 10 5 8 9 10 1 4 1
1 2 3 1 2 1 3 1 1 2 0
10 10 10 8 6 8 7 10 1 4 1
7 5 10 10 10 10 4 10 3 4 1
5 1 1 1 2 1 2 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
3 1 1 1 2 1 3 1 1 2 0
4 1 1 1 2 1 3 1 1 2 0
8 4 4 5 4 7 7 8 2 2 0
5 1 1 4 2 1 3 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
9 7 7 5 5 10 7 8 3 4 1
10 8 8 4 10 10 8 1 1 4 1
1 1 1 1 2 1 3 1 1 2 0
5 1 1 1 2 1 3 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
5 10 10 9 6 10 7 10 5 4 1
10 10 9 3 7 5 3 5 1 4 1
1 1 1 1 1 1 3 1 1 2 0
1 1 1 1 1 1 3 1 1 2 0
5 1 1 1 1 1 3 1 1 2 0
8 10 10 10 5 10 8 10 6 4 1
8 10 8 8 4 8 7 7 1 4 1
1 1 1 1 2 1 3 1 1 2 0
10 10 10 10 7 10 7 10 4 4 1
10 10 10 10 3 10 10 6 1 4 1
8 7 8 7 5 5 5 10 2 4 1
1 1 1 1 2 1 2 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
6 10 7 7 6 4 8 10 2 4 1
6 1 3 1 2 1 3 1 1 2 0
1 1 1 2 2 1 3 1 1 2 0
10 6 4 3 10 10 9 10 1 4 1
4 1 1 3 1 5 2 1 1 4 1
7 5 6 3 3 8 7 4 1 4 1
10 5 5 6 3 10 7 9 2 4 1
1 1 1 1 2 1 2 1 1 2 0
10 5 7 4 4 10 8 9 1 4 1
8 9 9 5 3 5 7 7 1 4 1
1 1 1 1 1 1 3 1 1 2 0
10 10 10 3 10 10 9 10 1 4 1
7 4 7 4 3 7 7 6 1 4 1
6 8 7 5 6 8 8 9 2 4 1
8 4 6 3 3 1 4 3 1 2 0
10 4 5 5 5 10 4 1 1 4 1
3 3 2 1 3 1 3 6 1 2 0
3 1 4 1 2 NA 3 1 1 2 0
10 8 8 2 8 10 4 8 10 4 1
9 8 8 5 6 2 4 10 4 4 1
8 10 10 8 6 9 3 10 10 4 1
10 4 3 2 3 10 5 3 2 4 1
5 1 3 3 2 2 2 3 1 2 0
3 1 1 3 1 1 3 1 1 2 0
2 1 1 1 2 1 3 1 1 2 0
1 1 1 1 2 5 5 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
5 1 1 2 2 2 3 1 1 2 0
8 10 10 8 5 10 7 8 1 4 1
8 4 4 1 2 9 3 3 1 4 1
4 1 1 1 2 1 3 6 1 2 0
3 1 1 1 2 NA 3 1 1 2 0
1 2 2 1 2 1 1 1 1 2 0
10 4 4 10 2 10 5 3 3 4 1
6 3 3 5 3 10 3 5 3 2 0
6 10 10 2 8 10 7 3 3 4 1
9 10 10 1 10 8 3 3 1 4 1
5 6 6 2 4 10 3 6 1 4 1
3 1 1 1 2 1 1 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
3 1 1 1 2 1 3 1 1 2 0
5 7 7 1 5 8 3 4 1 2 0
10 5 8 10 3 10 5 1 3 4 1
5 10 10 6 10 10 10 6 5 4 1
8 8 9 4 5 10 7 8 1 4 1
10 4 4 10 6 10 5 5 1 4 1
7 9 4 10 10 3 5 3 3 4 1
5 1 4 1 2 1 3 2 1 2 0
10 10 6 3 3 10 4 3 2 4 1
3 3 5 2 3 10 7 1 1 4 1
10 8 8 2 3 4 8 7 8 4 1
1 1 1 1 2 1 3 1 1 2 0
8 4 7 1 3 10 3 9 2 4 1
5 1 1 1 2 1 3 1 1 2 0
3 3 5 2 3 10 7 1 1 4 1
7 2 4 1 3 4 3 3 1 4 1
3 1 1 1 2 1 3 2 1 2 0
3 1 3 1 2 NA 2 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
1 1 1 1 2 1 2 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
10 5 7 3 3 7 3 3 8 4 1
3 1 1 1 2 1 3 1 1 2 0
2 1 1 2 2 1 3 1 1 2 0
1 4 3 10 4 10 5 6 1 4 1
10 4 6 1 2 10 5 3 1 4 1
7 4 5 10 2 10 3 8 2 4 1
8 10 10 10 8 10 10 7 3 4 1
10 10 10 10 10 10 4 10 10 4 1
3 1 1 1 3 1 2 1 1 2 0
6 1 3 1 4 5 5 10 1 4 1
5 6 6 8 6 10 4 10 4 4 1
1 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
8 8 8 1 2 NA 6 10 1 4 1
10 4 4 6 2 10 2 3 1 4 1
1 1 1 1 2 NA 2 1 1 2 0
5 5 7 8 6 10 7 4 1 4 1
5 3 4 3 4 5 4 7 1 2 0
5 4 3 1 2 NA 2 3 1 2 0
8 2 1 1 5 1 1 1 1 2 0
9 1 2 6 4 10 7 7 2 4 1
8 4 10 5 4 4 7 10 1 4 1
1 1 1 1 2 1 3 1 1 2 0
10 10 10 7 9 10 7 10 10 4 1
1 1 1 1 2 1 3 1 1 2 0
8 3 4 9 3 10 3 3 1 4 1
10 8 4 4 4 10 3 10 4 4 1
1 1 1 1 2 1 3 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
7 8 7 6 4 3 8 8 4 4 1
3 1 1 1 2 5 5 1 1 2 0
2 1 1 1 3 1 2 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
8 6 4 10 10 1 3 5 1 4 1
1 1 1 1 2 1 1 1 1 2 0
1 1 1 1 1 1 2 1 1 2 0
4 6 5 6 7 NA 4 9 1 2 0
5 5 5 2 5 10 4 3 1 4 1
6 8 7 8 6 8 8 9 1 4 1
1 1 1 1 5 1 3 1 1 2 0
4 4 4 4 6 5 7 3 1 2 0
7 6 3 2 5 10 7 4 6 4 1
3 1 1 1 2 NA 3 1 1 2 0
3 1 1 1 2 1 3 1 1 2 0
5 4 6 10 2 10 4 1 1 4 1
1 1 1 1 2 1 3 1 1 2 0
3 2 2 1 2 1 2 3 1 2 0
10 1 1 1 2 10 5 4 1 4 1
1 1 1 1 2 1 2 1 1 2 0
8 10 3 2 6 4 3 10 1 4 1
10 4 6 4 5 10 7 1 1 4 1
10 4 7 2 2 8 6 1 1 4 1
5 1 1 1 2 1 3 1 2 2 0
5 2 2 2 2 1 2 2 1 2 0
5 4 6 6 4 10 4 3 1 4 1
8 6 7 3 3 10 3 4 2 4 1
1 1 1 1 2 1 1 1 1 2 0
6 5 5 8 4 10 3 4 1 4 1
1 1 1 1 2 1 3 1 1 2 0
1 1 1 1 1 1 2 1 1 2 0
8 5 5 5 2 10 4 3 1 4 1
10 3 3 1 2 10 7 6 1 4 1
1 1 1 1 2 1 3 1 1 2 0
2 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
7 6 4 8 10 10 9 5 3 4 1
1 1 1 1 2 1 1 1 1 2 0
5 2 2 2 3 1 1 3 1 2 0
1 1 1 1 1 1 1 3 1 2 0
3 4 4 10 5 1 3 3 1 4 1
4 2 3 5 3 8 7 6 1 4 1
5 1 1 3 2 1 1 1 1 2 0
2 1 1 1 2 1 3 1 1 2 0
3 4 5 3 7 3 4 6 1 2 0
2 7 10 10 7 10 4 9 4 4 1
1 1 1 1 2 1 2 1 1 2 0
4 1 1 1 3 1 2 2 1 2 0
5 3 3 1 3 3 3 3 3 4 1
8 10 10 7 10 10 7 3 8 4 1
8 10 5 3 8 4 4 10 3 4 1
10 3 5 4 3 7 3 5 3 4 1
6 10 10 10 10 10 8 10 10 4 1
3 10 3 10 6 10 5 1 4 4 1
3 2 2 1 4 3 2 1 1 2 0
4 4 4 2 2 3 2 1 1 2 0
2 1 1 1 2 1 3 1 1 2 0
2 1 1 1 2 1 2 1 1 2 0
6 10 10 10 8 10 7 10 7 4 1
5 8 8 10 5 10 8 10 3 4 1
1 1 3 1 2 1 1 1 1 2 0
1 1 3 1 1 1 2 1 1 2 0
4 3 2 1 3 1 2 1 1 2 0
1 1 3 1 2 1 1 1 1 2 0
4 1 2 1 2 1 2 1 1 2 0
5 1 1 2 2 1 2 1 1 2 0
3 1 2 1 2 1 2 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 2 1 1 2 0
1 1 1 1 1 1 2 1 1 2 0
3 1 1 4 3 1 2 2 1 2 0
5 3 4 1 4 1 3 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
10 6 3 6 4 10 7 8 4 4 1
3 2 2 2 2 1 3 2 1 2 0
2 1 1 1 2 1 1 1 1 2 0
2 1 1 1 2 1 1 1 1 2 0
3 3 2 2 3 1 1 2 3 2 0
7 6 6 3 2 10 7 1 1 4 1
5 3 3 2 3 1 3 1 1 2 0
2 1 1 1 2 1 2 2 1 2 0
5 1 1 1 3 2 2 2 1 2 0
1 1 1 2 2 1 2 1 1 2 0
10 8 7 4 3 10 7 9 1 4 1
3 1 1 1 2 1 2 1 1 2 0
1 1 1 1 1 1 1 1 1 2 0
1 2 3 1 2 1 2 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
3 1 1 1 2 1 3 1 1 2 0
4 1 1 1 2 1 1 1 1 2 0
3 2 1 1 2 1 2 2 1 2 0
1 2 3 1 2 1 1 1 1 2 0
3 10 8 7 6 9 9 3 8 4 1
3 1 1 1 2 1 1 1 1 2 0
5 3 3 1 2 1 2 1 1 2 0
3 1 1 1 2 4 1 1 1 2 0
1 2 1 3 2 1 1 2 1 2 0
1 1 1 1 2 1 2 1 1 2 0
4 2 2 1 2 1 2 1 1 2 0
1 1 1 1 2 1 2 1 1 2 0
2 3 2 2 2 2 3 1 1 2 0
3 1 2 1 2 1 2 1 1 2 0
1 1 1 1 2 1 2 1 1 2 0
1 1 1 1 1 NA 2 1 1 2 0
10 10 10 6 8 4 8 5 1 4 1
5 1 2 1 2 1 3 1 1 2 0
8 5 6 2 3 10 6 6 1 4 1
3 3 2 6 3 3 3 5 1 2 0
8 7 8 5 10 10 7 2 1 4 1
1 1 1 1 2 1 2 1 1 2 0
5 2 2 2 2 2 3 2 2 2 0
2 3 1 1 5 1 1 1 1 2 0
3 2 2 3 2 3 3 1 1 2 0
10 10 10 7 10 10 8 2 1 4 1
4 3 3 1 2 1 3 3 1 2 0
5 1 3 1 2 1 2 1 1 2 0
3 1 1 1 2 1 1 1 1 2 0
9 10 10 10 10 10 10 10 1 4 1
5 3 6 1 2 1 1 1 1 2 0
8 7 8 2 4 2 5 10 1 4 1
1 1 1 1 2 1 2 1 1 2 0
2 1 1 1 2 1 2 1 1 2 0
1 3 1 1 2 1 2 2 1 2 0
5 1 1 3 4 1 3 2 1 2 0
5 1 1 1 2 1 2 2 1 2 0
3 2 2 3 2 1 1 1 1 2 0
6 9 7 5 5 8 4 2 1 2 0
10 8 10 1 3 10 5 1 1 4 1
10 10 10 1 6 1 2 8 1 4 1
4 1 1 1 2 1 1 1 1 2 0
4 1 3 3 2 1 1 1 1 2 0
5 1 1 1 2 1 1 1 1 2 0
10 4 3 10 4 10 10 1 1 4 1
5 2 2 4 2 4 1 1 1 2 0
1 1 1 3 2 3 1 1 1 2 0
1 1 1 1 2 2 1 1 1 2 0
5 1 1 6 3 1 2 1 1 2 0
2 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
5 1 1 1 2 1 1 1 1 2 0
1 1 1 1 1 1 1 1 1 2 0
5 7 9 8 6 10 8 10 1 4 1
4 1 1 3 1 1 2 1 1 2 0
5 1 1 1 2 1 1 1 1 2 0
3 1 1 3 2 1 1 1 1 2 0
4 5 5 8 6 10 10 7 1 4 1
2 3 1 1 3 1 1 1 1 2 0
10 2 2 1 2 6 1 1 2 4 1
10 6 5 8 5 10 8 6 1 4 1
8 8 9 6 6 3 10 10 1 4 1
5 1 2 1 2 1 1 1 1 2 0
5 1 3 1 2 1 1 1 1 2 0
5 1 1 3 2 1 1 1 1 2 0
3 1 1 1 2 5 1 1 1 2 0
6 1 1 3 2 1 1 1 1 2 0
4 1 1 1 2 1 1 2 1 2 0
4 1 1 1 2 1 1 1 1 2 0
10 9 8 7 6 4 7 10 3 4 1
10 6 6 2 4 10 9 7 1 4 1
6 6 6 5 4 10 7 6 2 4 1
4 1 1 1 2 1 1 1 1 2 0
1 1 2 1 2 1 2 1 1 2 0
3 1 1 1 1 1 2 1 1 2 0
6 1 1 3 2 1 1 1 1 2 0
6 1 1 1 1 1 1 1 1 2 0
4 1 1 1 2 1 1 1 1 2 0
5 1 1 1 2 1 1 1 1 2 0
3 1 1 1 2 1 1 1 1 2 0
4 1 2 1 2 1 1 1 1 2 0
4 1 1 1 2 1 1 1 1 2 0
5 2 1 1 2 1 1 1 1 2 0
4 8 7 10 4 10 7 5 1 4 1
5 1 1 1 1 1 1 1 1 2 0
5 3 2 4 2 1 1 1 1 2 0
9 10 10 10 10 5 10 10 10 4 1
8 7 8 5 5 10 9 10 1 4 1
5 1 2 1 2 1 1 1 1 2 0
1 1 1 3 1 3 1 1 1 2 0
3 1 1 1 1 1 2 1 1 2 0
10 10 10 10 6 10 8 1 5 4 1
3 6 4 10 3 3 3 4 1 4 1
6 3 2 1 3 4 4 1 1 4 1
1 1 1 1 2 1 1 1 1 2 0
5 8 9 4 3 10 7 1 1 4 1
4 1 1 1 1 1 2 1 1 2 0
5 10 10 10 6 10 6 5 2 4 1
5 1 2 10 4 5 2 1 1 2 0
3 1 1 1 1 1 2 1 1 2 0
1 1 1 1 1 1 1 1 1 2 0
4 2 1 1 2 1 1 1 1 2 0
4 1 1 1 2 1 2 1 1 2 0
4 1 1 1 2 1 2 1 1 2 0
6 1 1 1 2 1 3 1 1 2 0
4 1 1 1 2 1 2 1 1 2 0
4 1 1 2 2 1 2 1 1 2 0
4 1 1 1 2 1 3 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
3 3 1 1 2 1 1 1 1 2 0
8 10 10 10 7 5 4 8 7 4 1
1 1 1 1 2 4 1 1 1 2 0
5 1 1 1 2 1 1 1 1 2 0
2 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
5 1 1 1 2 1 2 1 1 2 0
5 1 1 1 2 1 1 1 1 2 0
3 1 1 1 1 1 2 1 1 2 0
6 6 7 10 3 10 8 10 2 4 1
4 10 4 7 3 10 9 10 1 4 1
1 1 1 1 1 1 1 1 1 2 0
1 1 1 1 1 1 2 1 1 2 0
3 1 2 2 2 1 1 1 1 2 0
4 7 8 3 4 10 9 1 1 4 1
1 1 1 1 3 1 1 1 1 2 0
4 1 1 1 3 1 1 1 1 2 0
10 4 5 4 3 5 7 3 1 4 1
7 5 6 10 4 10 5 3 1 4 1
3 1 1 1 2 1 2 1 1 2 0
3 1 1 2 2 1 1 1 1 2 0
4 1 1 1 2 1 1 1 1 2 0
4 1 1 1 2 1 3 1 1 2 0
6 1 3 2 2 1 1 1 1 2 0
4 1 1 1 1 1 2 1 1 2 0
7 4 4 3 4 10 6 9 1 4 1
4 2 2 1 2 1 2 1 1 2 0
1 1 1 1 1 1 3 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
2 1 1 1 2 1 2 1 1 2 0
1 1 3 2 2 1 3 1 1 2 0
5 1 1 1 2 1 3 1 1 2 0
5 1 2 1 2 1 3 1 1 2 0
4 1 1 1 2 1 2 1 1 2 0
6 1 1 1 2 1 2 1 1 2 0
5 1 1 1 2 2 2 1 1 2 0
3 1 1 1 2 1 1 1 1 2 0
5 3 1 1 2 1 1 1 1 2 0
4 1 1 1 2 1 2 1 1 2 0
2 1 3 2 2 1 2 1 1 2 0
5 1 1 1 2 1 2 1 1 2 0
6 10 10 10 4 10 7 10 1 4 1
2 1 1 1 1 1 1 1 1 2 0
3 1 1 1 1 1 1 1 1 2 0
7 8 3 7 4 5 7 8 2 4 1
3 1 1 1 2 1 2 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
3 2 2 2 2 1 4 2 1 2 0
4 4 2 1 2 5 2 1 2 2 0
3 1 1 1 2 1 1 1 1 2 0
4 3 1 1 2 1 4 8 1 2 0
5 2 2 2 1 1 2 1 1 2 0
5 1 1 3 2 1 1 1 1 2 0
2 1 1 1 2 1 2 1 1 2 0
5 1 1 1 2 1 2 1 1 2 0
5 1 1 1 2 1 3 1 1 2 0
5 1 1 1 2 1 3 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
4 1 1 1 2 1 3 2 1 2 0
5 7 10 10 5 10 10 10 1 4 1
3 1 2 1 2 1 3 1 1 2 0
4 1 1 1 2 3 2 1 1 2 0
8 4 4 1 6 10 2 5 2 4 1
10 10 8 10 6 5 10 3 1 4 1
8 10 4 4 8 10 8 2 1 4 1
7 6 10 5 3 10 9 10 2 4 1
3 1 1 1 2 1 2 1 1 2 0
1 1 1 1 2 1 2 1 1 2 0
10 9 7 3 4 2 7 7 1 4 1
5 1 2 1 2 1 3 1 1 2 0
5 1 1 1 2 1 2 1 1 2 0
1 1 1 1 2 1 2 1 1 2 0
1 1 1 1 2 1 2 1 1 2 0
1 1 1 1 2 1 3 1 1 2 0
5 1 2 1 2 1 2 1 1 2 0
5 7 10 6 5 10 7 5 1 4 1
6 10 5 5 4 10 6 10 1 4 1
3 1 1 1 2 1 1 1 1 2 0
5 1 1 6 3 1 1 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
8 10 10 10 6 10 10 10 1 4 1
5 1 1 1 2 1 2 2 1 2 0
9 8 8 9 6 3 4 1 1 4 1
5 1 1 1 2 1 1 1 1 2 0
4 10 8 5 4 1 10 1 1 4 1
2 5 7 6 4 10 7 6 1 4 1
10 3 4 5 3 10 4 1 1 4 1
5 1 2 1 2 1 1 1 1 2 0
4 8 6 3 4 10 7 1 1 4 1
5 1 1 1 2 1 2 1 1 2 0
4 1 2 1 2 1 2 1 1 2 0
5 1 3 1 2 1 3 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
5 2 4 1 1 1 1 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
1 1 1 1 1 1 2 1 1 2 0
4 1 1 1 2 1 2 1 1 2 0
5 4 6 8 4 1 8 10 1 4 1
5 3 2 8 5 10 8 1 2 4 1
10 5 10 3 5 8 7 8 3 4 1
4 1 1 2 2 1 1 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
5 10 10 10 10 10 10 1 1 4 1
5 1 1 1 2 1 1 1 1 2 0
10 4 3 10 3 10 7 1 2 4 1
5 10 10 10 5 2 8 5 1 4 1
8 10 10 10 6 10 10 10 10 4 1
2 3 1 1 2 1 2 1 1 2 0
2 1 1 1 1 1 2 1 1 2 0
4 1 3 1 2 1 2 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
1 1 1 1 1 NA 1 1 1 2 0
4 1 1 1 2 1 2 1 1 2 0
5 1 1 1 2 1 2 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
6 3 3 3 3 2 6 1 1 2 0
7 1 2 3 2 1 2 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
5 1 1 2 1 1 2 1 1 2 0
3 1 3 1 3 4 1 1 1 2 0
4 6 6 5 7 6 7 7 3 4 1
2 1 1 1 2 5 1 1 1 2 0
2 1 1 1 2 1 1 1 1 2 0
4 1 1 1 2 1 1 1 1 2 0
6 2 3 1 2 1 1 1 1 2 0
5 1 1 1 2 1 2 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
8 7 4 4 5 3 5 10 1 4 1
3 1 1 1 2 1 1 1 1 2 0
3 1 4 1 2 1 1 1 1 2 0
10 10 7 8 7 1 10 10 3 4 1
4 2 4 3 2 2 2 1 1 2 0
4 1 1 1 2 1 1 1 1 2 0
5 1 1 3 2 1 1 1 1 2 0
4 1 1 3 2 1 1 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
2 1 1 1 2 1 1 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
1 2 2 1 2 1 1 1 1 2 0
1 1 1 3 2 1 1 1 1 2 0
5 10 10 10 10 2 10 10 10 4 1
3 1 1 1 2 1 2 1 1 2 0
3 1 1 2 3 4 1 1 1 2 0
1 2 1 3 2 1 2 1 1 2 0
5 1 1 1 2 1 2 2 1 2 0
4 1 1 1 2 1 2 1 1 2 0
3 1 1 1 2 1 3 1 1 2 0
3 1 1 1 2 1 2 1 1 2 0
5 1 1 1 2 1 2 1 1 2 0
5 4 5 1 8 1 3 6 1 2 0
7 8 8 7 3 10 7 2 3 4 1
1 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 2 1 1 2 0
4 1 1 1 2 1 3 1 1 2 0
1 1 3 1 2 1 2 1 1 2 0
1 1 3 1 2 1 2 1 1 2 0
3 1 1 3 2 1 2 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
5 2 2 2 2 1 1 1 2 2 0
3 1 1 1 2 1 3 1 1 2 0
5 7 4 1 6 1 7 10 3 4 1
5 10 10 8 5 5 7 10 1 4 1
3 10 7 8 5 8 7 4 1 4 1
3 2 1 2 2 1 3 1 1 2 0
2 1 1 1 2 1 3 1 1 2 0
5 3 2 1 3 1 1 1 1 2 0
1 1 1 1 2 1 2 1 1 2 0
4 1 4 1 2 1 1 1 1 2 0
1 1 2 1 2 1 2 1 1 2 0
5 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
2 1 1 1 2 1 1 1 1 2 0
10 10 10 10 5 10 10 10 7 4 1
5 10 10 10 4 10 5 6 3 4 1
5 1 1 1 2 1 3 2 1 2 0
1 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 1 1 1 2 0
3 1 1 1 2 1 2 3 1 2 0
4 1 1 1 2 1 1 1 1 2 0
1 1 1 1 2 1 1 1 8 2 0
1 1 1 3 2 1 1 1 1 2 0
5 10 10 5 4 5 4 4 1 4 1
3 1 1 1 2 1 1 1 1 2 0
3 1 1 1 2 1 2 1 2 2 0
3 1 1 1 3 2 1 1 1 2 0
2 1 1 1 2 1 1 1 1 2 0
5 10 10 3 7 3 8 10 2 4 1
4 8 6 4 3 4 10 6 1 4 1
4 8 8 5 4 5 10 4 1 4 1

Multivariate Data Analysis

There are 16 missing valus in Bare Nuclei predictor. The most of these appear in 0 response variable (14 observations) and we have 458 cases of 0 category and 241 of 1 category, thus I can remove these observations from dataset.

data2 = data[!is.na(data$BN),]

Basic Stats

data2$ClassCat[data2$Dia == 4] = 'Malignant'
data2$ClassCat[data2$Dia == 2] = 'Benign'

Stats for category of Malignant:

malignant = summarytools::descr(data2[,c(1:9, 12)] %>% filter(ClassCat == "Malignant") %>% select(-ClassCat), round.digits=3)
kbl(malignant) %>%
  kable_paper() %>%
  scroll_box(width = "800px", height = "300px")
BC BN CT ECSize M MAF NN UCShape UCSize
Mean 5.9748954 7.6276151 7.1882845 5.3263598 2.6025105 5.5857741 5.8577406 6.5606695 6.5774059
Std.Dev 2.2824222 3.1166790 2.4379072 2.4430866 2.5644946 3.1966314 3.3488761 2.5691040 2.7242438
Min 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
Q1 4.0000000 5.0000000 5.0000000 3.0000000 1.0000000 3.0000000 3.0000000 4.0000000 4.0000000
Median 7.0000000 10.0000000 8.0000000 5.0000000 1.0000000 5.0000000 6.0000000 6.0000000 6.0000000
Q3 7.0000000 10.0000000 10.0000000 7.0000000 3.0000000 8.0000000 10.0000000 9.0000000 10.0000000
Max 10.0000000 10.0000000 10.0000000 10.0000000 10.0000000 10.0000000 10.0000000 10.0000000 10.0000000
MAD 2.9652000 0.0000000 2.9652000 2.9652000 0.0000000 4.4478000 4.4478000 2.9652000 2.9652000
IQR 3.0000000 5.0000000 5.0000000 3.5000000 2.0000000 5.0000000 6.5000000 5.0000000 6.0000000
CV 0.3820020 0.4086047 0.3391501 0.4586785 0.9853926 0.5722808 0.5717010 0.3915917 0.4141821
Skewness 0.0011900 -0.9140862 -0.4053150 0.5987979 1.7664759 0.0960912 -0.1047398 -0.0356159 -0.0839503
SE.Skewness 0.1574619 0.1574619 0.1574619 0.1574619 0.1574619 0.1574619 0.1574619 0.1574619 0.1574619
Kurtosis -0.9926759 -0.7055967 -0.8940947 -0.6469011 2.0634855 -1.3942210 -1.4693864 -1.2109414 -1.2932736
N.Valid 239.0000000 239.0000000 239.0000000 239.0000000 239.0000000 239.0000000 239.0000000 239.0000000 239.0000000
Pct.Valid 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000

Stats for category of Benign:

benign = summarytools::descr(data2[,c(1:9, 12)] %>% filter(ClassCat == "Benign") %>% select(-ClassCat), round.digits=3)
kbl(benign) %>%
  kable_paper() %>%
  scroll_box(width = "800px", height = "300px")
BC BN CT ECSize M MAF NN UCShape UCSize
Mean 2.0833333 1.3468468 2.9639640 2.1081081 1.0653153 1.3468468 1.2612613 1.4144144 1.3063063
Std.Dev 1.0622994 1.1778482 1.6726612 0.8771115 0.5097378 0.9170883 0.9546059 0.9570314 0.8556575
Min 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
Q1 1.0000000 1.0000000 1.0000000 2.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
Median 2.0000000 1.0000000 3.0000000 2.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
Q3 3.0000000 1.0000000 4.0000000 2.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
Max 7.0000000 10.0000000 8.0000000 10.0000000 8.0000000 10.0000000 8.0000000 8.0000000 9.0000000
MAD 1.4826000 0.0000000 2.9652000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
IQR 2.0000000 0.0000000 3.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
CV 0.5099037 0.8745227 0.5643325 0.4160657 0.4784854 0.6809151 0.7568661 0.6766273 0.6550206
Skewness 1.5589448 4.6393761 0.3368983 4.4155162 10.6165880 4.0114527 4.9442885 3.1654389 4.4191442
SE.Skewness 0.1158575 0.1158575 0.1158575 0.1158575 0.1158575 0.1158575 0.1158575 0.1158575 0.1158575
Kurtosis 4.8128913 25.0286648 -0.7904512 28.1507180 124.6675949 23.1618256 26.6467237 12.5306254 27.3250711
N.Valid 444.0000000 444.0000000 444.0000000 444.0000000 444.0000000 444.0000000 444.0000000 444.0000000 444.0000000
Pct.Valid 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000 100.0000000

Correlation Chart

pairs.panels(data[,c(1:9)], method="pearson",
             hist.col = "#1fbbfa", density=TRUE, ellipses=TRUE, show.points = TRUE,
             pch=1, lm=TRUE, cex.cor=1, smoother=F, stars = T, main="Correlation chart for all potentional predictors")

Principal Component Analysis

all_pca = prcomp(data2[,1:9], cor=TRUE, scale = TRUE)
summary(all_pca)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4289 0.88088 0.73434 0.67796 0.61667 0.54943 0.54259
## Proportion of Variance 0.6555 0.08622 0.05992 0.05107 0.04225 0.03354 0.03271
## Cumulative Proportion  0.6555 0.74172 0.80163 0.85270 0.89496 0.92850 0.96121
##                            PC8     PC9
## Standard deviation     0.51062 0.29729
## Proportion of Variance 0.02897 0.00982
## Cumulative Proportion  0.99018 1.00000
kbl(all_pca$rotation[,1:5]) %>%
  kable_paper()
PC1 PC2 PC3 PC4 PC5
CT -0.3020626 -0.1408005 0.8663725 -0.1078284 0.0803212
UCSize -0.3807930 -0.0466403 -0.0199378 0.2042554 -0.1456529
UCShape -0.3775825 -0.0824225 0.0335109 0.1758656 -0.1083915
MAF -0.3327236 -0.0520944 -0.4126473 -0.4931726 -0.0195690
ECSize -0.3362340 0.1644044 -0.0877425 0.4273836 -0.6366932
BN -0.3350675 -0.2612606 0.0006915 -0.4986177 -0.1247729
BC -0.3457474 -0.2280768 -0.2130718 -0.0130473 0.2276657
NN -0.3355914 0.0339658 -0.1342484 0.4171135 0.6902102
M -0.2302064 0.9055573 0.0804922 -0.2589878 0.1050417
fviz_eig(all_pca, addlabels = TRUE, ylim = c(0, 75), geom = c("bar", "line"), barfill = "pink",  
         barcolor="blue",linecolor = "red", ncp = 10) +
  labs(title = "PCA for all predictors, excluded NAs in BN predictor",
       x = "Principal Components", y = "% of Variance")

all_var = get_pca_var(all_pca)
all_var
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

cos2 represents the quality of representation for variables on the factor map. It’s calculated as the squared coordinates:

\[cos2 = coord * coord\]

corrplot(all_var$cos2, is.corr=FALSE)

contrib contains the contributions (in percentage) of the variables to the principal components. The contribution of a variable (var) to a given principal component is (in percentage):

\[contrib = \frac{cos2 * 100}{total\;cos2\;of\;the\;component}\]

corrplot(all_var$contrib, is.corr=FALSE)

Contribution of Variables to Dim1/Dim2:

p1 = fviz_contrib(all_pca, choice = "var", axes=1, fill = "pink", color = "grey", top = 10)
p2 = fviz_contrib(all_pca, choice = "var", axes=2, fill = "skyblue", color = "grey", top = 10)
grid.arrange(p1, p2, ncol = 2)

Scores and Loadings in one plot called PCA Biplot:

fviz_pca_biplot(all_pca, col.ind = data2$ClassCat, col = "black",
                palette = "jco", geom = "point", repel = TRUE,
                legend.title = "Diagnosis", addEllipses = TRUE)

Interactive 3D chart - correlation between UCSize, UCShape and BC:

plot_ly(data2, x=~UCSize, y=~UCShape, z=~BC,
        color=~ClassCat, mode= "markers",type = "scatter3d") %>% 
  layout(title = "UCSize, UCShape and BC correlation by Diagnosis")

Interactive score plot for PC1 and PC2:

pca_coordinates = as.data.frame(all_pca$x) %>% select(PC1, PC2) %>% cbind(data2[,c(1:9, 12)])
plot_ly(data = pca_coordinates, x = ~PC1, y = ~PC2, type = 'scatter',
               mode = 'markers', symbol = ~ClassCat, symbols = c('circle','x','o'),
               text = ~paste("CT: ", CT, '<br>UCSize:', UCSize, '<br>UCShape: ', UCShape, '<br>MAF: ', MAF, '<br>ECSize: ', ECSize),
               color = ~ClassCat, marker = list(size = 10)) %>%
  layout(title = "Score plot for PC1 and PC2")

Best Classification Model Selection

Below part demonstrates best classification model selection based on metrics such as Accuracy, Sensitivity and Specificity.

Best Model Selection.
\[Accuracy = \frac{True Positives + True Negatives}{Total Population}\]
\[Sensitivity = \frac{True Positives}{True Positives + False Negatives}\]
\[Sensitivity = \frac{True Negatives}{True Negatives + False Positives}\]

Load the Data and compute variable importance

Variable importance has been computed based on Random Forest model

List of hyperparameters for selected models is available at this site.

load("models.RData")
plot(varImp(rf_fit), main="Variable importance computed based on Random Forest model")

Process the Results

models = list(xgboost_fit, ada_fit, c50_fit, rf_fit, nb_fit, knn_fit, svm_fit, nnet_fit, gbm_fit, logreg_fit)
results_df = data.frame("Accuracy"=rep(0,length(models)),
                         "Sensitivity"=rep(0,length(models)),
                         "Specificity"=rep(0,length(models)))
for(i in 1:length(models)) {
  pred_mod = predict(models[[i]], newdata = testing)
  cm = confusionMatrix(data = pred_mod, reference = testing$Class)
  # names(cm)
  results_df[i,1] = cm$overall[[1]] # Accuracy
  results_df[i,2] = cm$byClass[[1]] # Sensitivity
  results_df[i,3] = cm$byClass[[2]] # Specificity
}

rownames(results_df) = c("XGBoost","AdaBoost","c50","RandomForest","NaiveBayes",
                          "Knn","SVM","NNET","GBM","LogReg")

results_df = results_df %>% arrange(desc(Accuracy))

Draw Final Outputs for Model Comparison

dotchart(results_df$Accuracy, labels=row.names(results_df), cex=1.3, main="Accuracy", pch=19)
dotchart(results_df$Sensitivity, labels=row.names(results_df), cex=1.3, main="Sensitivity", pch=19)
dotchart(results_df$Specificity, labels=row.names(results_df), cex=1.3, main="Specificity", pch=19)

results_df$Accuracy = cell_spec(results_df$Accuracy, color = ifelse(results_df$Accuracy > 0.97, "green", "black"))
kbl(results_df, escape = F) %>%
  kable_paper("striped", full_width = F)
Accuracy Sensitivity Specificity
NNET 0.982352941176471 0.990991 0.9661017
AdaBoost 0.976470588235294 0.990991 0.9491525
RandomForest 0.976470588235294 0.990991 0.9491525
c50 0.970588235294118 0.981982 0.9491525
SVM 0.970588235294118 0.990991 0.9322034
NaiveBayes 0.964705882352941 0.981982 0.9322034
LogReg 0.964705882352941 0.990991 0.9152542
XGBoost 0.958823529411765 0.990991 0.8983051
GBM 0.958823529411765 0.990991 0.8983051
Knn 0.764705882352941 1.000000 0.3220339

Conclusion

The main purpose of this presentation was to demonstrate various techniques for basic data analysis.

Significant part consists of classification model selection based on the results of cross validation procedure.